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Abstract 

Recent technological advances have led to a flood of new data 
on cosmology rich in information about the formation and evolution 
of the universe, e.g., the data collected in Sloan Digital Sky Survey 
(SDSS) for more than 200 million objects. The analyses of such data 
demand cutting edge statistical technologies. Here, we have used the 
concept of mixture model within Bayesian semiparametric methodol- 
ogy to fit the regression curve with the bivariate data for the apparent 
magnitude and rcdshift for Quasars in SDSS (2007) catalogue. As- 
sociated with the mixture modeling is a highly efficient curve-fitting 
procedure, which is central to the application considered in this paper. 
Moreover, we adopt a new method for analysing the posterior distribu- 
tion of clusterings, also generated as a by-product of our methodology. 
The results of our analysis of the cosmological data clearly indicate 
the existence of four change points on the regression curve and poss- 
sibiltiy of clustering of quasars specially at high redshift. This sheds 
new light not only on the issue of evolution, existence of acceleration 
or decceleration and environment around quasars at high redshift but 
also help us to estimate the cosmological parameters related to accel- 
eration or decceleration. 

Keywords: Cluster analysis; Cosmology; Dirichlet process; Model 
validation; Markov chain Monte Carlo; Non-linear regression. 

1 Introduction 

Among many different ways of testing models of cosmological sources, espe- 
cially quasars, one is through the investigations of the distributions, ranges 
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and more importantly the correlations among the relevant physical character- 
istics, such as luminosity, spectra, rcdshifts or cosmological distances. The 
impossibility of direct measurements to quasi stellar objects prevents one 
to validate any direct relationship between distance and redshift where, the 
measurable quantities are the apparent magnitude (m), redshift (z) which, as 
related to luminosity function or, even the probability distribution of absolute 
magnitude (M), as predicted by a given cosmology and the angular diameter 
of the object. The recent increase of the computing power led the theorists 
to simulate the realistic physical situations in specific details and predictions 
which are beyond the present limit of experimental techniques. For example, 
in the case of n-body simulations, it is possible to answer given certain initial 
conditions at some time ti and some assumed laws of physics, what will be the 
state of the system at later time t2 Hockney(1988) ? However, it is harder to 
solve the inverse problem Aster (2004): given all of the data, what can be said 
about the laws of physics that have been operating Brewer(2008)? Various 
cosmological models are considered to understand the formation and evolu- 
tion of the universe. In cosmology, for a given set of data, there exists many 
possible explanations. A typical observation may rule out some theories but 
may be consistent with some others. Again, many specific techniques have 
been constructed to tackle each inverse problem seperately. It is worth men- 
tioning that Efron(1992), Efron(1999) considered different types of statistical 
arguments and tests on the truncated data gathered by astronomers to ex- 
tract important statistical characteristics. One of the present authors (SR) 
along with his collaborators Roy(2007) used the non-parametric methods of 
Efron(1992) and Efron(1999) to study the bivariate distribution of two phys- 
ically important parameters i.e. redshift and apparent magnitude observed 
in SDSS quasar survey (2005). The data is truncated in nature. However, 
the data in SDSS quasar survey of 2007 is no longer truncated. Here, we 
will discuss a general framework using concepts of Bayesian mixture mod- 
els and Dirichlet process to study the existence of clusterings in the quasar 
sample for the whole range of redshifts and the changepoints associated to 
the non-linearity of the fitting curve. The existence of non-linearity can be 
associated with the certain physical factors like evolution and presence of 
different environments around quasars at high redshift. This will shed new 
light on the present cosmological debates regarding the concordant redshift, 
age of the universe and acceleration/deceleration parameters. 

On the statistical methodological side, we adopt a very fast and efficient 
method for learning about posteriors associated with Bayesian mixture mod- 
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els with unknown number of components. In particular, we adopt a semi- 
paramctric Bayesian curve-fitting procedure based on our mixture model. 
We demonstrate that our methods are particularly suitable for application in 
massive data, as our present cosmological data. We also adopt a methodology 
for analysing the posterior distribution of the clusterings associated with our 
Bayesian mixture model with unknown number of components. Our methods 
are broadly based on the works of Bhattacharya(2008) and Bhattacharya et 
al(2008); however, important extensions to modeling multivariate data are 
described here. Perhaps more importantly, we demonstrate in this paper 
that cutting-edge research works of great scientific importance are possible 
with our methodologies, despite the enormity of the size of the data sets. To 
our knowledge, such advancement in the Bayesian semiparametric/clustering 
paradigm has not been possible before. 

The rest of the paper is structured as follows. In Section 2 we explain 
the data and in this connection, provide a brief overview of Bayesian mixture 
models with unknown number of components. Our Bayesian semiparametric 
curve fitting method in (massive) data sets consisting of multivariate obser- 
vations is introduced in Section 3. A Gibbs sampling algorithm to simulate 
from the associated posterior distributions is derived in Section 4. In Section 
5 we illustrate our curve-fitting methodology with a simulated data set, and 
application to the real cosmological data is considered in Section 6. Discus- 
sion on summarization of the posterior distribution of clusterings is provided 
in Section 7, and application of the clustering ideas to the real cosmological 
data set is considered in Section 8. The implications of our analysis of the 
cosmological data set, and related future work, are enlisted in Section 9. 

2 The data and overview of mixture models 

Our massive cosmological data set, consists of 96307 data points on loga- 
rithm of redshift (z) and apparent magnitude (m) for Qsasars (qsasi-stellar 
objects) collected from SDSS data. The data set does not reveal any clear- 
cut parametric relationship between the two variables of interest; moreover, 
our exploratory analyses clearly indicated that the (bivariate) normality as- 
sumption does not hold for the data. Indeed, our quantile-quantile plots of 
each of the two variables showed that the marginal distributions of both the 
variables are far from univariate normal. 

To resolve this problem we will use idea of mixture models, which are 
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noted for their flexibility. Indeed, as noted by Dalal(1983) and Diaconis(1985), 
mixture models composed of standard densities can, in principle, approx- 
imate any underlying distribution. For more on mixture models, see Tit- 
terington(1985), McLaclilan(1988). However, a technical problem associated 
with classical analysis of mixture models is associated with the number of 
mixture components included in the model. In the classical statistical liter- 
ature there does not seem to exist any rigorously procedme of selecting an 
adequate number of mixture components. On the other hand, the Bayesian 
paradigm offers elegant solutions to this problem. Among the contributions of 
Bayesians in this topic, notable are those of Escobar(1995) (henceforth, EW) 
and Richardson(1997) (henceforth, RG). The former use Dirichlet process to 
indirectly induce (random) variability in the number of components, while 
the approach of RG directly acknowledges uncertainty about the number of 
components and puts a prior distribution on the same, thus rendering the 
problem variable-dimensional. The methodology of RG relies on reversible 
jump Markov chain Monte Garlo (RJMGMG) Green(1995) for drawing in- 
ference. 

However, it is important to note that the RJMCMC method proposed by 
RG is quite complicated, and is error prone. But of more concern is the fact 
that their methodology is extremely sensitive to the "move types" selected, 
and since there are no general guidelines for selecting optimum move types, 
the algorithm could be very inefficient. Moreover, for variable-dimensional 
problems diagnosis of convergence of RJMCMC is a serious problem. The 
aforementioned problems asociated with the RJMCMC method are of course 
many times aggravated for multivariate observations. The methodology of 
EW is not a variable dimensional problem and straightforward Gibbs sam- 
pling methods are available, however, the number of parameters increases 
with data size, making Gibbs sampling (or any other sampling methods) 
infeasible for massive data sets. In response to this computational chal- 
lenge Wang(2008) have proposed the sequential updating and greedy search 
(SUGS) algorithm which proceeds by cyhng through the data points, sequen- 
tially allocating them to the cluster that maximizes the conditional posterior 
allocation probability. The conditional distribution of the unknown param- 
eter, which admits a closed form expression given the maximizing cluster, 
is then updated. A complete sweep of the algorithm yields the conditional 
posterior distribution of all the parameters, given the seuqentially optimal 
clusterings. The advantage of the method of Wang(2008) is that it is quite 
fast, since it does not rely upon MCMC methods. The disadvantages are 
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that the method does not have a theoretical basis, in that the correct joint 
or marginal posterior distributions of the parameters or clusterings are not 
obtained. Moreover, although the algorithm produces a sequentially optimal 
clustering, it does not yield a global maximum a posteriori (MAP) estimate. 
The algorithm depends upon the the order in which we consider the obser- 
vations. In case of large data set this problem is tackled by considering a 
few random ordering of the observations and then using pseudo-marginal 
likelihood (PML), which makes this method an ad hoc one. Perhaps more 
critically, the algorithm does not assist in any way in obtaining and studying 
the probability distribution of the clusterings. 

We avoid all the difficulties noted above by adopting a model which may 
be viewed as a reconciliation of the methods of EW and RG. The details are 
outlined next. 



3 Direct Bayesian mixture modeling of mul- 
tivariate observations using Dirichlet pro- 
cess and associated Bayesian curve fitting 

The cosmological data set of our interest is, as already noted above, consists 
of bivariate observations. As a result, the model and the methodologies 
proposed by Bhattacharya(2008) warrants extension to bivariate, in fact, 
more generally, to multivariate situations. For the sake of full generality, 
we extend the proposals of Bhattacharya(2008) to the case of d-dimensional 
observations, where d > 1. 

We assume for i = 1, . . . , n, data set Y = {y^^, . . . , y„} is available, where 
observation yj can be modeled as a mixture of (i-variate normal distributions, 
having p components. Crucially, p is assumed to be unknown. Rather than 
assuming a prior distribution on p like RG and treating the problem as vari- 
able dimensional, we assume the following form of mixture representation of 
the (i-variate observation y^: 

1 ^ lA 1^ r 1 1 

In the above, M(> p) is the maximum number of components the mixture 
can possibly have, and is known; @m = {Oi, . . . , 6m}, with 6j = {fij, Aj). 
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We further assume that ©m arc samples drawn from a Dirichlet process (see, 
for example, Ferguson(1973), EW) 

Oj is iid from G 
G is from DP^aGo) 

A crucial feature of our modelling style concerns the discreteness of the 
prior distribution G, given the assumption of Dirichlet process; that is, under 
these assumptions, the parameters 0^ are coincident with positive probability. 
In fact, this is the property that can be exploited to show that (1) boils down 
to the form 

[y. I 0m] = E ^.|^ exp |-i (y, - M ■)' A* (y, - m*) } (2) 

where {0*, . . . , 0*} are p distinct components in ©m with d* occuring Mj 
times, and ttj = MjjM. Hence, although our model is actually variable di- 
mensional, this is induced through the Dirichlet process prior, and does not 
involve complexities as in RJMCMC. In fact, we will derive an easily imple- 
mentable Gibbs sampling algorithm, even for highly multivariate observaions. 
Observe that, in sharp contrast to the proposed model of EW, the number 
of parameters to be simulated remains fixed (since the maximum number of 
mixture components is fixed), even though the number of observations, n, 
could be extremely large. 

Associated with the mixture model (1) is the idea of Bayesian curve- 
fitting. This we illustrate in the next section. 

3.1 Bayesian curve fitting 

In simplified notation, we write (1) as 

[y|®^] = ]^E^4y^M„A7^) (3) 

It follows that the conditional distribution of yi given = (y2, ■ ■ ■ , Z/d)' is 
given by 

1 ^ 

[y, I ©M, y_i] = ^Y. ^^-^ ■ ^-y' ^=y) X ^ (^1 : Mil.,.. aSII.,,) 

(4) 
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where jj}^^^ ^ and A!^'|2 a respectively, the univariate conditional mean 
E{yi I y_i,©M) and the inverse precision 1/V{yi \ y_i,0M) under the 
assumption that y is from Nd{tJ,j, Aj). The {d — 1) dimensional parameters 
lj,_ij,A_ij stand for Hj,Aj but without the first component. 

As a result, assuming k distinct components 01, ... ,61 in @m, and as- 
suming further that each distinct component d* occurs Mj times, we have, 

k 

E[yi I @M, y-i] = ^^'\y-ML.4 (5) 

3=1 

is a weighted sum of the component regression functions /x^jg where the 
associated weight w^^\y_i) is given by 



w^-'\y-i) is proportional to 

^^^-i(y-i^M-i.>A*:y (6) 

and the proportionality constant is chosen such that w^*^-'''(y_i) = 1- 

Note that the regression function estimator developed above is struc- 
turally quite different from that given by Muller(1996), who develop a re- 
gression estimator based on the model of EW. One clear advantage of our 
curve over that of Muller(1996) is that for massive data sets the curve-fitting 
idea of Muller(1996) can not be implemented due to extreme computational 
burden, while our curve (5) can be easily fitted to any data set, massive or 
not. 

Assuming that a sample |©m ; • • • > ^m"*! available from the posterior 
distribution of ©m (typically by MCMC), the marginalized curve E{yi \ y_i) 
is estimated as 

1 ^ 

E{y, I y_J = E{E{y, \ 0M,y_i)) «^ ^ E ^(^^ I ®2,y-i) (7) 

t=i 

Pointwise variability of the curve is measured by 

Var{yi \ y_,) = Var{E{y^ \ 0m, y_i)) + E{VaT{y^ \ ©m, y_i)) (8) 

The first component of the above variance is estimated by the sample variance 

and the second component is estimated 
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by the sample mean of [Var{yi | eg, y_i); i = 1, . . . , 7v}. Approximate 

100(1 — T)per cent pointwise credible intervals of the curve are given by E{yi \ 
y_i) ± A/^^ar(^7Ty^^, where z-r is the lOOr-th quantile of a standard 
normal distribution. 

Hence, once the MCMC realizations |0£\ . . . , are available, it 

is an easy task to obtain a Bayesian regression curve with all summaries 
readily available. In the next section we derive an extremely fast and easily 
implementable Gibbs sampling algorithm. 

4 Gibbs sampling implementation of the pro- 
posed model 

We assume that under Gq, 

[Aj] is from 

Wzshartd (^^, 1^ (9) 
[Hj I Aj] is from 

Ar4/Uo,M7') (10) 
Hence, the joint distribution of 6j is given by 

'SA/ 



s-d-l { (\ 

[Aj] [Mj I H = c| A,- 1 ^ exp I -^'^ ( - 



where 



c = 



d{d-\) a I s 
TT 4 — 2 
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4.1 Representation of the mixture using allocation vari- 
ables and associated full conditional distributions 

The distribution of [y^ | &m] given by (1) can be represented by introducing 
the allocation variables follows: 
For i — 1, . . . ,n and j — 1,. . M, 

[y^ I Zi = j, 0m] = 1^4^ cxp I - ^ (y^ - /i^.)' A^- (y^ - ^i^) \ (12) 

(27r)2 I ^ ) 

= (13) 

It follows that the full conditional distribution of the allocation variables 
Zi {i — 1, . . . ,n) given the rest is given by 
[zi — j \ Y, ©M, Z_j] is proportional to 

|^exp|-^(y,-M^.)'A,(y,-/xj|; j = l,...,M (14) 

4.2 Full conditionals of 6j 

Defining rij = H^{i '■ Zi = j} and y^ = ^j.^.^^ y^/'^j, we note that the full 
conditional distribution of Oj given the rest is given by 

M 

[0, |Y,Z,0_,M]=go,G,(0,)+ (15) 



Under Gj the distribution of dj is given by: 

^ - Is 

2 ' 2 I ' 71.-0 + 1 



\K 1 f w u + ( ^ + 1 Jc , ^^■(yj-/*o)(yj-Mo)' , - w 

[A,] from wishartd + — ^ , . + (y* " yi)(yi " y. 



(16) 



riji/j + 1 ' (rijV' + 
In (15) qoj and q^j are given by the following: 
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Qoj is proportional to a, where 




1 



2 



a 



a 



riji/j + 1 



X 



S+Uj+l—l 

2 



) 



(18) 



"j(yj-M())(yj-Mo)' 

Tljlp + l 





+ 



and, Qij is proportional to 




n 



i-Zi=j 



(19) 



The proportionality constant is chosen such that qoj + X^^^^ ^.^^ % = 1. 

It is useful to provide the intuition behind updating the allocation vari- 
ables Z and the parameters ©m- Given M distinct values of the parameter 
vector @M, the allocation vector Z clusters the n-dimensional observation 
vector Y into M*(< M) clusters of the form Uj = {i : Zi = j}; j = 1, . . . , M*. 
These can be thought of as the initial clusters, since the Dirichlet process 
prior acts upon {Ui, . . . ,Um*}: to yield k{< M*) distinct parameter values 
01,. . . ,61 out of the possible M* distinct values to yield the final cluster- 
ing, say, {Vi, .... Vk}, of {Ui, Um}, with Ve = Uj.c^=iUj. The clusters 
Ve are associated with the configuration vector C. Clearly, the clustering 
{Vi, . . . ,Vk} is coarser than {Ui, . . . ,Um} in the sense that the former con- 
sists of lesser number of blocks with more elements in each block. We note a 
computational advantage our method over the RJMCMC algorithm of RG. 
Note that empty components are naturally handled in our method; indeed, if 
the j'-th component is an empty component (which can happen if the alloca- 
tion variables do not allocate any observation to the j-th component), then 
the fact rij = {i : Zi = j} — occurs naturally in our model and no special 
care is necessary for validation of this step. But this situation requires an 
extra, careful, and complicated step in the method of RG. 



10 



4.3 Full conditional distribution of a 



This remains same as in the univariate case, which is given, for the prior 
Gamma{aa-i ha) on a, given the number of distinct components /c, and an- 
other continuous random variable 77, by 

[a I Y, Z, ©M, /c, Tj] from Tir^Gamma{aa + k,ba — log{ri)) 

+ (1 — 7r^)Gamma(aQ, + k — l,ba — log(?7))(20) 

where 

l-TTr, m{ha - \og{r])) 

The full conditional distribution of 77 given the rest is Beta{a + 1,M), 
that is, a Beta distribution with mean (a + 1)/(q; + M + 1). 



4.4 Full conditional distributions of /Xq and ip 

The distributions of the hyperparameters /j,q and ip are given by: The dis- 
tribution of [/Iq I Y, Z,©M,'0] is d-variate normal, with mean vector and 
dispersion matrix given by: 

E [mo I Y, Z, 0M, V'] = (^^^ + ^11 + j (22) 

y[/Uo|Y,Z,0M,^] = ^# + Aj]A*j AV^ (23) 

In the above, we have denoted the identity matrix by I. The full conditional 
distribution of psi is given by 
[i/j I Y, Z, &M: Mo] from 

Gamma ( ^ ^ ^ ) (24) 



In (24), = E-=i(M,* - Mo)'A*(m; - Mo)- 

We have thus derived a simple Gibbs sampling algorithm for our mix- 
ture model with unknown number of components, which is computationally 
highly suitable for massive data sets, thanks to the fixed maximum number 
of components. 
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5 Simulation study to illustrate the perfor- 
mance of our curve-fitting method 

We assume a bivariate normal distribution of two random variables (y, x) 
(that is, d = 2 in the general multivariate methodologies developed in Sec- 
tions 3 and 4), where the true regression function of y on x is x+sin x, a highly 
non-linear curve. Pretending that the true curve is unknown, and that all 
we have is a sample of 1000 observations (xj, yi);i = 1, . . . , 1000, we demon- 
strate that our curve-fitting idea can accurately estimate the (unknown) true 
curve. We obtain the data by actually simulating from the bivariate normal 
distribution. 

To implement our curve-fitting procedure, we need to fit the data using 
the Bayesian mixture model based on Dirichlet process. Some of the prior 
parameters are chosen such that fast convergence to the target posterior is 
ensured, and other choices (and justifications thereof) are motivated by those 
of EW, RG, and Bhattacharya(2008). For example, selecting Hq to be the 
sample mean vector, and S to be the sample dispersion matrix indicated 
good mixing properties of our Gibbs sampler. However, it is important to 
select the prior parameters of a carefully, since this can significantly affect 
the probability distribution of the number of components, and hence the fit 
of the curve. To select an appropriate prior for a, we first assume that it is a 
constant to be determined (by a procedure to be described below). Once it 
is determined, we select the prior parameters and ba such that the mode 
of the prior distribution of a, Gamma{aa, b^) is set equal to the determined 
value and the variance is as large as possible to refiect our vagueness about 
the prior. 

To determine the mode of the prior of a, we fit the Bayesian curve with 
many fixed values of a, and compute the maximum absolute difference at 
Xi,i = 1, . . . , 1000 between the true curve and the fitted curve E{y \ x). We 
choose that value of a as the prior mode for which the deviation is less than 
0.4 and the fitted curve contains most of the features of the true curve. 

Table 1 displays the maximum absolute deviations corresponding to a 
fixed value of a. To obtain each row of Table 1 we ran our Gibbs sampler 
for 20000 iterations, discarding the first 5000 iterations as burn-in. From the 
table we choose the value 25 as the mode of the prior distribution of a as 
the optimum choice. 

Hence, we fix the prior mode of Gamma{aa, ba), given by (oq — l)/6a = 25, 
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Figure 1: Bayesian curve fitting: the fitted curve (green) and the true curve 
(yellow) associated with the simulation study. The red curves denote point- 
wise 95 percent credible intervals. 
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Table 1: Two-way table showing the deviations of the fitted curve from the 
true curve 



Value of a 


Deviation 


0.5 


1.004 


1.0 


0.896 


5.0 


0.597 


10.0 


0.4898 


15.0 


0.4154 


25.0 


0.355 



so that = 256q, + 1. Now note that the variance of Gammaiaai ba) is 
o-a/ba = (25/6q,) + l/&^. Fixing ba = 0.1 yields a considerably large variance 
of 350. Hence, we fix ba = 0.1, which implies = 256^ + 1 = 3.5. 

The associated diagram Figure 1, which corresponds to the derived prior 
choice a from Gamma{3.5, 0.1) shows that the true regression function (yellow- 
coloured) is estimated quite accurately by the fitted Bayesian semiparametric 
curve (green-coloured) for this choice. Moreover, the pointwise 95 percent 
credible intervals (red-coloured) show that the entire true curve lies within 
the credible limits. This is very encouraging, given that the true model is 
highly non-linear. 

6 Application of the curve-fitting procedure 
to the real cosmological data set 

We now apply our methodologies to analyse the massive cosmological data 
set described in Sections 1 and 2. Recall that the data set consists of 96307 
bivariate data points on logarithm of redshift (z) and apparent magnitude 
(m) for Qsasars (qsasi-stellar objects) collected from SDSS data, and be- 
cause of the immense number of observations, it is absolutely impossible 
to implement the method of EW. The RJMCMC method of RG has been 
illustrated for univariate observations only, and even in that situation the 
procedure is overly complicated, and, in fact, had forced an error from the 
authors (the corrigendum has been provided in Richardson(1998). For bi- 
variate observations, as in our example, the RJMCMC algorithm proposed 
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by RG for univariate observations is not scalable for bivariate (or multivari- 
ate) observations without serious loss of efficiency. In sharp contrast to these 
popular methods, our methodologies, as we have shown, are easily and ef- 
ficiently scalable to any dimensionality, and quite importantly, is extremely 
fast, unlike the method of EW or other related ideas. Indeed, although it is 
impossible to implement the methods of EW in our real example, our Gibbs 
sampling algorithm completed 20,000 iterations in just about 10 hours; con- 
sidering the enormity of the number of observations, this indicates great 
efficiency. We discarded the first 5000 MCMC realizations as burn-in and 
stored the remaining 15,000 for inference. Informal convergence diagnostics 
indicated excellent mixing properties of our algorithm. A convergence diag- 
nostic method suited for semiparametric mixture models has been prescribed 
by Bhattacharya(2008); their method confirms excellent convergence in this 
example. 

In this real data situation, unfortunately, the true curve is unknown, 
hence we can not use exactly the same procedure as in the simulation study 
case to determine the prior of a. However, the concept of mixture models 
offers another interesting alternative, as detailed below. It is well-known 
that as the number of components in the mixture increases, closer is the 
approximation to the true curve. The price paid is the loss of parsimony of 
the model, however, we can forsake parsimony only for determining the prior 
of a, not for model-fitting. So, for our purpose, we first fit a mixture model 
to the cosmological data with a fixed (large) number of components. Since 
m = 30 was fixed as the maximum number of components in our Dirichlet 
process-based model, M = 30 is a natural choice for the mixture model 
with fixed, but large number components. The resulting Gibbs sampler is 
implemented by simulating the allocation variables from the full conditional 
distribution (14) but simulating the parameters dj from Gj, rather than from 
(15) for all iterations. 

The curve thus obtained can be taken as a close approximation to the 
"true" curve. We further increased the value of M to 50 but noted no sig- 
nificant deviation of the resuting curve from that corresponding to M = 30. 

We then applied the prior determining procedure in the case of «, as 
described in Section 5, given the "approximately true" curve as obtained 
by the above method. In other words, successively fixing a and noting the 
maximum absolute deviations of the fitted curves from the "approximately 
true" curve, we chose the appropriate value of a, which turned out to be 50 
in this real cosmological data case. As a consequence the prior on a is given 



15 



by Ga'mma{26, 0.5). 



6.1 Fitted Bayesian cosmological curve and change point 
analysis 

The fitted curve and the associated pointwise 95 percent credible intervals 
are shown in Figure 2 (due to difficulty detailed plot of data can not be 
shown); the green line represents the estimated Bayesian cosmological curve, 
and the pointwise 95 percent credible intervals arc shown in black colour. The 
difference in the nature of the lines in the same curve occurs due to variaton 
in the nature of red shift of the quasars of different ages. The number of 
different such quasars is reflected in the number of distinct components of the 
mixture model. The different distinct components of the mixture correspond 
to distributions of absolute magnitude for different ages of the clusters. The 
above discussion points towards a need for detailed cluster analysis of the 
data set, where not only the number of clusters, but the entire clustering is 
of interest to astro-physicists. 

The obtained non-linear curve is linear for the first half {z < —2.0) with 
intercept 18.7840 and slope 0.5136 (1.182608 with respect to logarithm with 
base 10, which is of interest to astro-physicists). After that, however, non- 
linearity is exhibited. But we also note that the form of non-linearity can be 
approximated by linear line segments indicating presence of change points. 
A close look will reveal presence of four change points, but to get a solid 
basis of belief, we performed a detailed change point analysis, assuming four 
change points. 

Although a Gibbs sampling algorithm is available on the similar lines of 
Carlin(1992), the algorithm is computationally expensive because of the mas- 
sive number of observations. Instead, we resort to the Metropolis-Hastings al- 
gorithm for simulating from the posterior. We omit details to save space, but 
remark that we achieved excellent convergence with our Metropolis-Hastings 
algorithm. 

Figure 2 also shows that the curve obtained by the change point analysis, 
which is shown in red colour, nicely approximates our fitted semiparametric 
Bayesian curve (the green curve) at all places except at the extreme lower 
end of the x-space, where there are hardly any information about the curve. 
Moreover, the entire change point curve falls within the (pointwise) 95 per- 
cent credible intervals associated with the semiparametric Bayesian curve. 
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Figure 2: the border of the most dense part of the data and some of the 
distant lying values are shown in blue, the fitted curve in green, the change 
point curve in red, and the pointwise 95 percent credible intervals are shown 
in black colour. i 7 



Apart from the semiparametric Bayesian curve and the change point 
curve, we also fitted the least squares regression line, obtained by assuming 
a simple linear regression of \og{z) on M. This linear regression is related 
to Hubble's law, and the implications of the slope of this straight line will 
discussed later in detail. For now we note that the least squares regres- 
sion line falls well within the pointwise 95 percent credible intervals of our 
Dirichlet process-based semiparameteric Bayesian curve. This shows that 
the linear regression, although not optimal (in the sense that normality as- 
sumption does not hold for this data set, for example), is not ruled out by 
our semiparametric method. 



6.2 Estimation of the densities of the observed data 
and goodness of fit check 

Note that the marginal densities oi y = m and x = log{z) can be esti- 
mated from our mixture model, given the MCMC-based posterior reahzations 

1©^^^; t — 1, . . . , A^j, for any X — x and Y — y, as 

1 ^ 

t=i 

N M 

= S^EE'V(-''S.1/a£') (25) 

t=l j=l 

and 

1 ^ 

t=i 

= iijjtt'^(y--f'ly^'h (26) 

t=i j=i 

Pointwise 95 percent credible intervals can be obtained for each of the marginal 
densities as in the case of Bayesian curve estimation. 

These Bayesian density estimates are useful for model validation pur- 
pose. In fact, these density estimates can be compared with the observed 
histograms of the individual variables of the observed data. A high degree of 
discrepancy between the observed histogram and the corresponding density 
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estimate will indicate lack of model fit. Figures 3 and 4 show the observed 
marginal histograms, the marginal density estimates, and the associated 95 
percent credible intervals of the true density. A few sample densities are 
also shown. Very clearly, the marginal density estimates fit the histogram 
very satisfactorily, leaving no reason to doubt the validity of our mixture 
model. In fact, the histograms (if smoothed by any means), the density esti- 
mates, and also the sample densities, all lie within their respective 95 percent 
credible intervals, which is very encouraging. 

7 Bayesian posterior distribution of cluster- 
ing 

As discussed in Section 6.1, it is of interest to astro-physicists to conduct a 
detailed study of the clusters of the data set. Thus, a methodology is needed 
which provides not only the posterior distribution of the number of clusters, 
but the posterior distribution of the clusterings, using which summaries of 
clusterings may be obained. We note that, our Gibbs sampling algorithm 
generates a clustering with varying number of clusters in each iteration (see 
Section 4.2; see also the Appendix, where the randomness of clusterings 
and the number of clusters is induced by the configuration vector C) The 
fact that even if the number of clusters are same in any two iterations, the 
corresponding clusterings are still different, shows that it is important to 
deal with the posterior distribution of clusterings rather than the posterior 
distribution of the number of clusters. Moreover, it is usually of scientific 
interest to analyse some representative of the clusterings produced by the 
Gibbs sampler, as in our cosmological example. 

It is to be noted that this problem is much more difficult as compared 
to summarization of posterior distribution of a parmeter. In the case of 
a parameter the posterior distribution can be summarized by its posterior 
mean or mode (analytical or sample-based). Similarly desired credible regions 
can be easily calculated. But it is not possible to take means of chistcrings 
produced. Due to continuity of the parameters the mean will give rise to 
a M-component clustering, though all the clusterings might consist of less 
than M clusters. Moreover the clusterings are permutation invariant. That 
is two clustering may be same except for a permutation of the components. 

Gonstruction of credible region poses even more difficulties. Here we use 
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Figure 3: Marginal density estimation of log(2;): the red lines represent the 
95 percent limits of the density, the green line stands for the fitted density 
and the blue lines represent sample densities. 
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Figure 4: Marginal density estimation of m: the red lines represent the 95 
percent limits of the density, the green line represents the fitted density and 
blue lines stand for sample densities. 
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a methodology introduced by Bhattacharya(2009) to tackle such difficulties. 

The methodology of Bhattacharya(2009) relies on an appropriately de- 
fined metric to compute distances between any two clusterings of a given 
data set. The metric was used to compute the posterior probability distribu- 
tion of clusterings, and to provide a "central" clustering and the associated 
credible regions. The authors apphed their methodology to analyse the pos- 
terior distribution of clusterings of a large vegetation data set obtained from 
the Western Ghats in India, generated by the method of EW. In this paper, 
we apply their cluster analysis methodology to the clusterings generated by 
our Bayesian mixture model. 

7.1 Definition of central clustering 

Guided by the definition of mode in the case of parametric distributions, given 
a suitable metric d to compute the distance between any two clusterings, 
Bhattacharya(2009) define a clustering C* as "central" if, for a given small 
e > 0, 

P {{C : d{C*, C) < e}) = sup P {{C : d{C' , C) < e}) (27) 

c 

Thus for a sufficiently small e > 0, the probability of an e-neighbourhood 
of an arbitrary clustering C is highest when C = C*, the central clustering. 
The above definition holds for all positive e if the distribution of clustering 
is unimodal. 

Otherwise the depending on e we will have different local modes of clus- 
tering, from among which the global mode is to be determined. 

7.2 Empirical Definition of Central Clustering 

We define that clustering C^^ as "approximately central" which, for a given 
small e > 0, satisfies the following equation 

CO) = arg max {C(^); 1 < A; < TV : (i(C», C^'^)) < e} (28) 

The central clustering C^^^ is easily computable, given e > and a suitable 
metric d. Also, by the ergodic theorem, as N ^ oo the empirical central 
clustering C^-'^ converges almost surely to the exact central clustering C*. 
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Given a central clustering C^^^ one can then obtain, say, an approximate 
95 percent highest posteror density credible region as the set {C^''^; 1 < k < N : d{C''^\ C^^^) < e*}, 
where e* is such that 

^# (C^*^); l<k<N: d{C^^\ C^^^) < e*] ^ 0.95 (29) 

In (29) e* must be chosen by trial and error. 



7.3 Choice of the metric d 

Two clusterings may not be very easily comparable as the cluster number of 
one may totally unrelated to the cluster numbers of the other. So, one way to 
compare them is to find a measure of divergence between them after permut- 
ing the arbitrary indices to make the two clusterings as close to each other 
as possible. Ghosh(2008) define the distance (i(/, //) between clusterings / 
and // as follows. 



d{I, II) = min[noo - {uij^ + ^2^2 + . . . + rifejJ]/noo (30) 

over all permutations (ji, J2, ■ ■ ■ , Jfc) of (1,2,..., /c), where k denotes the num- 
ber of clusters, n.y is the number of units belonging to the i-th cluster of / 
and j-th cluster of //, and noo = X] X] total number of units. 

For justification of the above idea, and for the proof that (30) satisfies the 
properties of a metric, see Ghosh(2008). 

However, computation of the above metric (30) requires the minima over 
all possible permutations of the clusters. If the number of clusters under 
consideration is large this leads to enormous computational burden. For 
MCMC iterations, one needs to compute the metric for a large number of 
clusterings (one for each iteration), and since each iteration may yield quite 
a large number of clusters, the calculation quickly becomes infeasible. To 
overcome this Bhattacharya(09) propose an approximation to (30) as 

(i(/,//) = max |(i(/, //),(!(//,/) I (31) 

where 

d{I,II) = I Woo - max^ njj | / ripo (32) 

^ ^ _ Ei=i niaxi<j<fc riij ^^^^ 
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Very clearly, no computational labour is required to compute d{I,II). 
Very importantly, Bhattacharya(09) demonstrate that d provides very ac- 
curate approximations to the original metric d. Moreover, it is easy to see 
that d satisfies first three properties of a metric. The fourth property can 
be seen to be valid when the clusterings are independent. But no counter 
example has been so far come across. So Bhattacharya(09) conjecture that 
d is a metric. As a result, for our analysis we will always use d instead of d. 

8 Application of the clustering idea to the 
cosmology data set 

On application of the central clustering ideas, we observe that for different 
range of values of e > we have different central clusterings, clearly indicating 
multimodahty of the posterior distribution of clusterings. For < e < 0.05 
the central clustering is 1138-th clustering after considering burn-in. For 
0.05 < e < 0.1 it is 4341-th; for 0.1 < e < 0.3 it is 4849-th; for 0.3 < e < 0.5 
the number is 570-th clustering after considering burn-in etc. Following the 
technique Bhattacharya(09) applied for obtaining the global central cluster- 
ing, we obtain the clustering corresponding to iteration number 1137 as the 
global central clustering. The radius of 95 percent credible region of the 
global mode is 0.35, which is reasonably low. 

We note that the central clustering in our case consists of 29 clusters. 
This is quite reasonable, given that there are more than 96,000 observations. 
Moreover, we note that although there are 29 clusters, many are effectively 
the same cluster, thanks to the small Euchdean distances between them. 
This reduction of the effective number of clusters finds reasons more than 
statistical within the astro-physics paradigm. Indeed, astro-physicists (for 
example, Roy (2007)) have tried to split the data into 2 characteristics only, 
namely Radio loud and Radio quiet. 

Driven by the above observations and discussions, we merge those clus- 
ters with Euclidean distances less than a prefixed limit. Table 2 shows how 
the number of clusters change if the prefixed limit is changed. The merged 
central clustering consisting of two components only (which corresponds to 
the prefixed limit being 0.9) is shown in Figure 5. This is provided to make 
our analysis comparable to the clustering done by astro-physicists on the 
basis of Radio loud and Radio quiet. 
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Figure 5: Merged central clustering with 2 clusters: the differently marked 
parts indicate two different clusters. 
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Table 2: Table showing the variation in the number clusters with change in 
the prefixed limit 



Value of prefixed limit 


Number of clusters after merger 


0.05 


23 


0.1 


21 


0.3 


10 


0.5 


9 


0.65 


5 


0.7 


4 


0.9 


2 



We have repeated the same analysis taking the maximum number of com- 
ponents, M = 50. No notable difference between the results of the two anal- 
yses were observed. Indeed, even with M = 30, the posterior probability of 
30 components turned out to be negligible (about 0.001). 

9 Possible implications of the results of sta- 
tistical analysis 

Our statistical analysis of the SDSS data has a number of implications that 
may give answer to many interesting questions from view point of quasar 
astronomy and cosmological models. 

• The curve is linear for the small values of z and becomes nonlinear 
for high values. It is to be noted that for low redshift z the curve is 
linear with gradient of 0.5136 (1.182608 when logarithm of z is taken 
in log base 10 while fitting the curve) and intercept 18.7840. In case 
of standard cosmological model, the gradient of the Hubble line is sup- 
posed to he between 4.8 and 5.3 Efron(92). One of the present authors 
SR Roy(07) analysed the quasar data using non-parametric methods 
developed by Efron(92) both for Veron Getty as well as for SDSS DR-3 
data and found also linearity for small z and non-linearity for high z. 

• The whole curve can be approximated by five line segments with four 
change points. 
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• We performed a detail cluster analysis that gives rise to possible number 

of clusters as 29. But from observing values of the components it is 
evident that the clusters can be merged further to fewer number of 
clusters. The degree of reduction depends upon the prefixed threshold 
for merging the clusters. 

• The merged clusters can be compared with the clusterings observed by 
astronomers Porciani(2006) for example, redshift dependent clusters or 
luminosity dependent clusters. This will be discussed elaborately in 
future communications. 

• There exists two broad class of quasars like radio-loud and radio quiet 
quasars. The environments around these clusters of quasars are differ- 
ent. The width and other characteristics of the emission or absorption 
lines from these quasars will be affected depending on the nature of the 
environments. 

• In our framework, we have merged the clusters into two broad cluster- 
ings imder certain thresholds. The characteristics of these clusterings 
need to be investigated in details so as to compare with the radio loud 
or radio quiet quasars which will be done in subsequent publications. 

The non-linearity of the curve may be due to several factors like evolution 
of the quasars, acceleration or deceleration of the universe. Our findings will 
shed new light not only on the validity of Hubble law but also help us to 
estimate the acceleration/deceleration parameters. These issues are very 
much important from the point of view of cosmological debates and will be 
considered in details in subsequent publications. 

Appendix 

A Reparameterization using configuration in- 
dicators 

Let kj denote the number of distinct values in 0_jM, and let ; £ = 1, . . . , kj 
denote the distinct values. Also suppose that 0^ occurs M^j times. Then a 
reparameterization of our model parameters can be devised as follows. 
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As in Muller(96), we introduce the configuration vector C = {ci, . . . , cm}, 
where Cj = £ if and only if 6j = 6*f>;j,i = 1,...,M. The configuration 
vector C thus provides a reparameterization of the original parameters, the 
latter being reparameterized into distinct components and the associated 
configuration vector. Using this reparameterized version one can can avoid 
simulation of all the parameters corresponding to all m components. In 
fact, once a configuration is simulated, only the distinct parameters may be 
simulated. Moreover, the corresponding Gibbs sampler may have superior 
convergence properties (see MacEachern(94)). 

A.l Full conditional distributions of the distinct values 

of Qrn 

The conditional posterior distribution of 6} is given by 
[6>; I Y, Z, C] from 

Wishart, (A| : s}, S}) x N, : /u*, ^1^*,-') (34) 

In the above, n} = J2r.c,=e^j^ Te = Ei:c,=£%yi/ Ej:s,=£^i> 4 = ^> S| = 

1 |s + nKA^o^g^o-yP' + E,.,=.".(y, - y.*)(y, - y.*)' + E,:e,=. E..,=,(y. - y,)(y. - y,)'}, 

and ip} = ipl {ipnl + 1). It is to be noted that the 6^ are conditionally inde- 
pendent. 

A. 2 Full conditional distributions of the configuration 
indicators Cj 

The conditional distributions of Cj are given, in the multivariate case, by 
[cj = £ I Y, Z, C_j, ©m] is proportional to 

r if £ = !,..., A;, 
\ qoj if i ^ kj + 1 

where qoj is the expression given by (18), and 
q^j is proportional to 



(35) 
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ia:i 2 



(27r) 



I i-Zi=j 



Note that it is possible to replace ^oj in (18) with 



« ^exp 

(27r)^ 



i:Zi=j 



where 0* = {fi*, A*)fromGo. The latter formulation is most appropriate 
when Go is not conjugate to the likelihood, which may preclude integration 
of (37) with respect to Go, making the explicit form of qoj intractable. In 
our case, we can also integrate q^j with respect to the conditional posterior 
distribution of 0^ given by (34) to obtain 



C = Me, 



1\ 2 



ipn*(, + 1 



V^n^ + iprij + 1 



X 



n 



X 



(1) 



1 r 



.(2) 



(37) 



In (37) 



S + 



^Kmo -y£)(Mo -y^)' 



ipn} + 1 



XI ^j(yj -yl)(yj-yl)' 



(38) 



j:sj=ii:zi=j 



and 



S + 



^Kmo -y^)(Mo -y^)' 



+ (yi-yj)(yi-yi)' 



ipn} + 1 



j:sj=ti:zi=j 



+ 



nAifj - Mo) + ^n}{yj - TeMYj - /^o) + ^n}{yj - y})]' 



{i/jn} + l)(t/'n| + i/jUj + 1) 



(39) 
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